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In the past two years, the world witnessed the spread of the coronavirus 
(COVID-19) pandemic that disrupted the entire world, the only solution to 
this epidemic was health isolation, and with it everything stopped. When 
announcing the availability of a vaccine, the world was divided over the 
effectiveness and harms of this vaccine. This article provides an analysis of 


vaccinators and analysis of people's opinions of the vaccine's efficacy and 


whether negative or positive. Then a model is built to predict the future 
Keywords: numbers of vaccinators and a model that predicts the number of negative 
opinions or tweets. The model consists of three stages: first, converting data 
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Machine 1 : sets into a synchronized time series, that is, the same place and time for 
achine learning vaccination and tweets. The second stage is building a prediction model and 


P redicting À the third stage was descripting analysis of the prediction results. The 
Sentiment analysis autoregressive integrated moving averages (ARIMA) method was used after 
Vaccine decomposing the components of ARIMA and choosing the optimal model, the 


best results obtained from seasonal ARIMA (SARIMA) for both predictions, 
the last stage is the descriptive analysis of the results and linking them together 
to obtain an analysis describing the change in the number of vaccinators and 
the number of negative tweets. 
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1. INTRODUCTION 

The world is going through a transition; there is a change in how day-to-day activities are handled during 
the continuing coronavirus (COVID-19) epidemic-whether it's e-learning or the way people socialize, engage, 
utilize, or store things [1]. At this time, it is essential to take the necessary precautions to protect oneself, including 
washing one's hands often, wearing a mask when going into close quarters, and avoiding unnecessary physical 
contact. However, the effects of these interventions are limited to preventing the further spread of the COVID-19 
rather than eliminating it. At this point, vaccination started to play an essential role as the only method that had any 
chance of controlling and ultimately eliminating the COVID-19 [2]. Extensive testing was carried out using the 
first mRNA vaccines to be made available for purchase. Over 40,000 individuals participated in the Pfizer 
vaccination experiment, while over 30,000 people participated in the vaccine trial conducted by Moderna [3]. The 
manufacture of various vaccinations is a significant challenge; nevertheless, the startling lack of people's desire and 
motivation to get vaccinated is even more disturbing and of considerable worry to health specialists interested in 
determining the reasons behind this phenomenon. Since its inception, the vaccination procedure has been met with 
ambivalent reactions from the general public; even within our own families, we have been subjected to 
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disagreements or questions over this topic [4]. People use social media platforms to share their thoughts, opinions, 
and reactions to better manage and respond to extreme crises. 

This is essential for social media platforms to play in severe problems because it allows people to 
collaborate on crisis management and response [5]. The COVID-19 has been one of the moving topics on Twitter 
since January 2020 and continues to be investigated to this day [6]. In this way, users of Twitter can easily get the 
sentiments of their tweets out to the people [7]. Predicting the numbers of people vaccinated against the COVID-19 
and analyzing their feelings about the types of vaccines were among the encouraging topics for researchers. The 
tweet represents the user’s opinion, and if sentiment analysis is relied upon to know the user’s opinion, it can be 
based on sentiment analysis and consider it an influential element in people’s conviction of the quality of the product 
[8]. The article's Alam et al. [4] propose a technique for analyzing tweets. 

The natural language processing (NLP) technology known as valence aware dictionary (VADER) was 
used to investigate individuals' feelings towards every kind of vaccination. Reasoning based on sentiment 
VADER. It was helpful to conceptualize the entire situation by dividing the polarity of the received feelings into 
three groups: positive, negative, and neutral. The findings showed that 33.96% of the replies were favourable, 
17.55% were negative, and 48.49% were neutral [9]. They also included a timeline analysis of tweets in this poll 
because respondents’ emotions changed over time. When using long short-term memory (LSTM) and recurrent 
neural networks (RNN), including bidirectional (BiLSTMs), to test the performance of the prediction models, 
LSTMs obtain an accuracy of 90.59%. The BiLSTMs have a success rate of 90.83%. Many other performance 
measures such as accuracy, F1 score, confusion matrix, and so on were also employed to verify the model and 
the findings correctly. This work contributes to the objective of eradicating COVID-19 all over the world by 
enhancing the public's awareness of how COVID-19 may be prevented by vaccination. This study [6] predicted 
the popularity of tweets by analyzing public opinion and sentiment at different stages of the COVID-19 
pandemic, from disease outbreak to vaccine distribution. Five sets of content features were extracted and applied 
to supervised machine learning algorithms, including topic analysis, topics and term frequency inverse document 
frequency (TF-IDF) vectorizers, TF-IDF vectorizer wordbags (BOW), document embedding, document 
embedding, and TF-IDF vectorizers. You posted a tweet to generate it. 

According to the analysis, tweets with high emotional strength are more popular than tweets with 
information about the COVID-19 pandemic. Based on two statistical models, as well as the deep learning (DL) 
model, the author of the paper [10] does an analysis and makes a forecast about the daily number of confirmed 
cases of COVID-19. DNN with long-term memory, autoregressive integrated moving averages (ARIMA), 
generalized autoregressive conditional heteroskedasticity (GARCH), and GARCH stacked on top of each 
other. With the use of autocorrelation and partial autocorrelation functions, as well as an exhaustive search for 
DL model hyperparameters like the number of LSTM cells and cell blocks, the order of the statistical model 
may be identified. The experiment utilizes 10 data sets. Conduct research on how factors such as data size and 
inoculation affect performance. The numerical findings demonstrate that the performance depends on the data 
that were utilized and the data that were initially used. It is also shown that LST MDNN can produce more 
accurate forecasts compared to the two statistical models. According to the experiments’ findings, LSTM DNN 
can improve up to 88.54% (86.63%) and 90.15% (87.74%), respectively. 

According to Ardabili et al. [11] analyzed and evaluated the capabilities of several machine learning 
models to forecast the spread of COVID-19 in the United States of America, China, Iran, Germany, and Italy. 
The models’ findings using an adaptive neural fuzzy inference system and multilayer perceptron (MLP) showed 
some encouraging signs. Research has shown that machine learning models are better than other approaches 
to simulating the COVID-19 epidemic. A further recommendation in the study was to predict death rates in 
order to anticipate the demand for critical care beds. At the end of the day, she advocated combining machine 
learning with susceptible-exposed-infectious removed (SEIR) models to enhance the accuracy and timeliness 
of typical epidemiologic models. These models take into account susceptible individuals who have been 
exposed to infectious individuals who have been removed. It was found that multiple supervised machine 
learning approaches were employed to mimic COVID-19 infection in Mexico in the recently published work 
[12]. Analyses were performed using a variety of models, including SVM, logistic regression, decision trees, 
Naive Bayes, and artificial neural networks. The study of correlation coefficients was used to examine how the 
dataset's characteristics are related to one another. The findings showed that the accuracy of the decision tree 
model was the greatest possible, coming in at 94.99%. Naive Bayes' specificity was 94.30%, while the SVM's 
sensitivity was 93.34%. There were seven alternative regression models employed by [13] to forecast the 
number of infected Egyptians (exponential, logit, quadratic, third, fourth, fifth, and sixth-degree). They trained 
the models using data from the official database, which was available from February 15 to June 15 of the 
subsequent year. After 15 days, one month, and one month, these models precisely anticipated the formation 
of COVID-19 and its final magnitude and longevity. These models were shown to be most accurate in 
predicting future events for 15 days afterward. On the other hand, the fourth-degree model's predictive power 
was shown over one month. Using the logit growth regression model, the pandemic determination would reach 
its zenith on June 22, 2020, and that it would end on September 8, 2020. In addition, it was anticipated that 
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there would be a total of 166,760 cases of the pandemic. However, the presented findings could not be trusted 
entirely because of the existing social and environmental (climatic) circumstances. 


2. THEORETICAL BASIS FOR THE POPOSED MODEL 
2.1. Sentiment analysis 

The word "sentimental analysis" refers to various activities, including deducing findings, defining 
assessments, organizing subjectivity, cataloguing assumptions, and identifying spam. SA intends to research 
people's presumptions, attitudes, mentalities, conclusions, sentiments, and so on concerning things, people, 
problems, associations, administrations, and so on [14]. In the field of sentiment analysis, several 
methodologies, such as lexical-based methods and supervised machine learning, may assist us in determining 
how people are feeling about something. Learning by machine necessitates the use of training data, which may 
be challenging to get. In addition, the training process takes a significant amount of time and is computationally 
demanding regarding the needs placed on the CPU and memory [15]. A set of linguistic properties that are 
commonly categorized as positive or negative according to their semantic orientation constitutes what is known 
as a sentiment lexicon. The majority of research in sentiment analysis makes use of preexisting lexicons that 
were developed by human labor. This is because establishing a lexicon is a difficult task. LIWC 1, GI 2, and 
Hu-Liu04 3 are the lexicons used most of the time [16]. 


2.2. TextBlob 

Developed in Python, Textblob is a tool for manipulating large amounts of text. For the purpose of 
conducting NLP operations, it provides a standard application programming interface (API) [9]. It is identical to 
a string written in Python [17]. It uses a sentiment lexicon and a sentiment analysis engine called pattern.en. 
Pattern.en analyzes the text based on the English adjectives included in it, and WordNet is utilized to do this. 
TextBlob will produce a tuple of the form (polarity, subjectivity) whenever it does sentiment analysis on a text. 
The polarity value will be a float that falls in range [-1,1] [16]. One of the advantages of using TextBlob is that 
its strings are pretty similar to those of Python. TextBlob's operation will become easier to use. In addition to 
tokenization and noun phrase extraction, Textblob contains capabilities such as sentiment analysis, point-of-sales 
tagging, language translation and detection, n-grams, spelling correction, and interactivity with WordNet [17]. 


2.3. Time series data and prediction 

Records of observations made on a specific topic throughout numerous periods make up what is 
known as time-series data. Collecting observable data with a uniform distribution may be defined as a time 
series. These data are also collected at consistent intervals. Data analysis methods are used in time series 
analysis to describe the data and derive meaning and usefulness from statistical information. A model is used 
in time series forecasting, which predicts future values based on values that have been observed in the past. 
Concerning the procedures part of this process, the forecast is based only on historical data. It operates under 
the presumption that the exact causes that affect the past and present will also impact the future [18]. If it is 
possible to deduce the values of a time series's future observations from its previous observations, then the time 
series is unavoidable (deterministic). If the importance of the time series may partly determine the future of a 
time series in the past, then the time series in question can be described as stochastic or random. Successful 
linear approaches include ARMA and ARIMA, both of which are linear models; nevertheless, the predictive 
capacity of linear models is constrained by the linear behavior of the underlying data [19]. 


2.4. ARIMA model 

As a shorthand for the Box-Jenkins model, the ARIMA acronym stands for (p, d, q). parameter p 
denotes the order of autoregression, parameter d denotes difference, and parameter q denotes moving average. 
" For the ARIMA model, the letters "AR," "MA," and "I" represent autoregressive, moving average, and 
integration, respectively [20]. ARIMA models for stationary time series have the following mathematical 
representations: [3]. Autoregressive model of order p or AR(p) model: 


Ve = È +O Ve-1 + Ø2Yt-2 + °° + ØpYt-p + & (1) 
Moving-average model or order q or MA(q): 
Ve = O1Et-1 + O2Et-2 Hove + Og €t—q +E (2) 


Autoregressive moving average model of order p and q or ARMA (p,q): 
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where Ø is the parameter for autoregression and 6 is the parameter for moving average. yt represents the actual 
value at time t, while? is the constant. The random disturbance term is et considered white noise, with a mean 
of 0 and a variance shared by all times [21]. In recent years, the ARIMA model has emerged as one of the most 
used strategies for predicting epidemics. This approach is well suited for the short-term prediction of infectious 
illnesses, and the accuracy of its predictions has been extensively acknowledged; it may also provide practical 
support for disease prevention and policy development [22]. 
Metrics and statistical models used: 

In this study, the following are the primary metrics and statistical models that were employed (where 
Xi represents the actual data, for instance, I and Pi represent the prediction [23]. The mean absolute error 
(MAE) measures the average number of mistakes for a group of forecasts, even though it does not consider the 
direction in which the errors are going. The average of the absolute differences the sample exhibited between 
the estimates and the actual observation, considering that each deviation is of equivalent weight [24]. 


MAE = Liz !Xi-Pil (4) 
n 


Mean absolute percentage error (MAPE) quantifies the precision of a forecasting system. This 
accuracy is expressed as a percentage, which may be computed as the average absolute% inaccuracy for each 
period minus the actual values divided by the actual values [25]. 


(5) 


MAPE = +5}, |=" 


Xi 


3. METHOD 

The availability of data on COVID 19 in all its variants, including the number of infections, the 
number of recoveries, the number of deaths, the number vaccinated, and the public's opinion of vaccination, 
has prompted many researchers to use these data. This article analyzes the data on the number of vaccinees and 
develops a model to predict the future number of vaccinees. It also analyzes people's opinions about the vaccine 
by dividing the tweets into positive, negative, and normal. ARIMA was used to predict the number of future 
vaccinees and was used to predict the number of negative tweets about a vaccine. Figure 1 shows the proposed 
model. After forming the data series for each of the vaccinees dataset and the tweets with the same date, i.e. 
day and month, the TextBlob library is used to analyze and classify the tweets (positive, negative and normal). 


-ia 
Vaccination Sentiments __. 
Tweets dataset ; analysis 
—_ 


Prediction by 
ARMA 


Vaccinated 
oe a Sanii 3 


Figure 1. Method of the proposed model 


3.1. Prediction for vaccinations 

For the number of vaccinations, ARIMA is used for prediction. Figure 2 shows the total number of 
vaccinations Pfizer in the top 10 countries form 12/12/2020 to 23/11/2021. Figure 3 shows the decomposition 
of ARIMA to the vaccinations. From the previous analysis, we may infer that there is an "upward trend" in 
overall vaccination rates. Therefore, this time series is 'non-stationary,' and based on the seasonal component, 
we may conclude that the model is 'additive' since the seasonal component remains constant (i.e., it does not 
become multiplied) across time. Best model: ARIMA (2,2,1) (0,0,0) [0]. Automated model selection 
determines seasonal ARIMA (SARIMA) (2,2,1) to be the optimal model based on AIC. Prob(Q)=0.82>0.05. 
We should not reject the null hypothesis that residuals are uncorrelated since they are not correlated. 
Prob(JB)=0.00<0.05. The null hypothesis that residuals are regularly distributed is rejected. The residuals are 
thus regularly distributed. The model's residuals appear to exhibit correlation. ARIMA is the sole model that 


Predicting COVID-19 vaccinators based on machine learning and sentiment ... (Hadab Khalid Obayes) 


1652 O ISSN: 2302-9285 


rejects the Jarque-Bera hypothesis (Prob (JB) 0.05). Therefore, the residuals for this model have a normal 
distribution. 


1e10 total vaccinations of Pfizer for top ten counties 


a 


Germany 


Figure 2. The total vaccinations in top 10 countries 
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Figure 3. The ARIMA decomposition 


We must make sure that our model's residuals are uncorrelated, normally distributed, and zero-mean. 

If not, it means that the model can be strengthened further, and we go through the same procedure using the 

residuals. Figure 4 demonstrates the normal distribution. Based on the following, our model diagnostics in this 

scenario indicate that the model residuals are normally distributed: 

- The residuals on the Figure 4(b) KDE plot almost closely resemble the normal distribution. 

- The ordered distribution of residuals (blue dots) as shown in the Figure 4(c) follows the linear trend of the 
samples selected from a standard normal distribution with N(0, 1). Once more, this strongly suggests that 
the residuals are normally distributed. 

- The residuals over time (Figure 4(a)) do not appear to show any discernible seasonality and merely appear 
to be noise. This is supported by the autocorrelation (also known as a correlogram) Figure 4(d), which 
illustrates the time series residuals' low correlation with their lagged counterparts. 

These findings, along with the absence of spikes outside the insignificant zone of correlogram plots, 

lead us to believe that the residuals are random and lack any information or juice, and our model generates a 

satisfactory fit that could aid in our understanding of the time series data and future value forecasting. Our 

model seems to be operating without any problems. Figure 5 shows the forecasting result for the next (90) days 
ahead for the vaccinations in the world. 
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Figure 4. Residual plots (a) standardized residual for “t”, (b) histogram plus estimated density, 
(c) theoretical quantiles, and (d) autocorrelation 
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Figure 5. Forecasting 90 days ahead 


3.2. Prediction for tweets 

At this stage, the number of tweets that were written during the vaccination period and whose dates 
matched the date of the vaccinations will be processed. Figure 6 represents the number of tweets with dates 
corresponding to the dates of vaccinations. Best model: ARIMA (6,0,0) (0,0,0) [0], total fit time: 3.485 seconds. 
Automated model selection choses SARIMA (6,0,0) as best model based on AIC. Prob(Q)=0.00<0.05. the null 
hypothesis is rejected that the residuals are correlated so the residuals are correlated. Prob (JB)=0.00<0.05. the 
null hypothesis is rejected that the residuals are normally distributed. Therefore, the residuals are normally 
distributed. 
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Figure 6. The tweets number 


4. RESULTS AND DISCUSSION 

The residuals of an ideal model should consist of uncorrelated white Gaussian noise centred at zero. 
By studying the above charts with this in mind, we can determine whether or not our model is accurate. For 
sample-based forecasting, we utilize the get prediction method using the last days of training data as validation 
data. Then, we utilize sklearn. Metrics' mean absolute error and mean absolute percentage error to calculate 
MAE and MAPE for the model. Table 1 displays the metric values used to assess the proposed model. 
Figure 5 shows the forecasting result for the next (90) days ahead for the vaccinations in the world. 

The ideal model of tweets is chosen utilize sklearn. After studying the above charts with this in mind, 
the model can be determined if it accurate or not. For sample-based forecasting, the last days of training data 
is utilize as validation data for the prediction method. Then, the metrics mean absolute error and mean absolute 
percentage error are used to calculate MAE and MAPE for the model. Table 2 displays the metric values used 
to assess the proposed model. Figure 7 shows the prediction result for the next (90) days ahead for the negative 
tweets. 


Table 1. The metrics value for the total number of Table 2. The metrics value for the number of 
vaccinations tweets 
Metric SARIMA (2,2,1) Metrics SARIMA (2,2,1) 
MAE 1.478688e+07 MAE 1.0221 
MAPE 4.900000e-02 MAPE 0.919 
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Figure 7. The prediction of 90 day ahead for the negative tweets 
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Post analysis: after the results of predicting the numbers of vaccinated people appeared and it appeared 
that the numbers were constantly increasing. This shows that there is more demand for taking the vaccine, 
which means that people's confidence in the vaccine began to increase, and what proves this is the emergence 
of the results of predicting the number of negative tweets, which was constantly decreasing with time. It means 
that there is an inverse relationship between the number of vaccinated people and the number of negative tweets 
about the vaccine. 


5. CONCLUSION 

Introduces ARIMA models and their variants: SARIMA and ARIMAX, which employ external data 
(exogenous inputs) to enhance the performance of the ARIMA model. The Box-Jenkins approach was used to 
identify the optimal model for a portion of the data set (time series of vaccinations). Important time series 
features, such as stationarity and seasonality, are recognized as the first step. Once the model has identified an 
acceptable solution, it is used to forecast in a sample, i.e., it is applied to a subset of the training data as 
validation data. After then, the projection extends 90 days beyond the sample period. The findings 
demonstrated a high degree of accuracy (a very low error rate) for the proposed model, which is optimistic for 
the future use of this model for forecasting the preparation. 
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